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ABSTRACT 

A scattering transform defines a signal representation which 
is invariant to translations and Lipschitz continuous relatively 
to deformations. It is implemented with a non-linear con- 
volution network that iterates over wavelet and modulus op- 
erators. Lipschitz continuity locally linearizes deformations. 
Complex classes of signals and textures can be modeled with 
low-dimensional affine spaces, computed with a PCA in the 
scattering domain. Classification is performed with a penal- 
ized model selection. State of the art results are obtained for 
handwritten digit recognition over small training sets, and for 
texture classification. \j 

Index Terms — Image classification, Invariant represen- 
tations, local image descriptors, pattern recognition, texture 
classification. 



1. INTRODUCTION 



the scattering domain, which are computed with a PCA. The 
classification is performed by a penalized model selection. 

Scattering operators may also be invariant to any compact 
Lie subgroup of GL(M 2 ), such as rotations, but we concen- 
trate on translation invariance, which carries the main diffi- 
culties and already covers a wide range of classification ap- 
plications. Section |2] reviews the construction of scattering 
operators with a cascade of wavelet transforms and modu- 
lus operators, which defines a non-linear convolution network 
(6). Section [3] shows that learning affine scattering model 
spaces has a linear complexity in the number of training sam- 
ples. Section [4] describes state of the art classification re- 
sults obtained from limited number of training samples in 
the MNIST hand-written digit database, and for texture clas- 
sification in the CUREt database. Software is available at 
www . cmap . polytechnique . f r / scattering. 

2. SCATTERING OPERATORS 



X 



Affine space models are simple to compute with a Principal 
Component Analysis (PCA) but are not appropriate to ap- 
proximate signal classes that include complex forms of vari- 
ability. Image classes are often invariant to rigid transforma- 
tions such as translations or rotations, and include elastic de- 
formations, which define highly non-linear manifolds. Tex- 
tures may also be realizations of strongly non-Gaussian pro- 
cesses that cannot be discriminated with linear models either. 

Kernel methods define distances d(f,g) = \\$(f) — 
$((;)||, with operators $ which address these issues by map- 
ping / and g into a space of much higher dimension. How- 
ever, invariance properties and learning requirements on small 
training sets, rather suggest to implement a dimensionality 
reduction. 

Scattering operators constructed in |9 10|, are invariant to 
global translations and Lipschitz continuous relatively to lo- 
cal deformations, up to a log term, thus providing local trans- 
lation invariance through the linearization of such deforma- 
tions. These scattering operators create invariance by aver- 
aging interference coefficients, which capture signal interac- 
tions at several scales and orientations. This paper models 
complex signal classes with low-dimensional affine spaces in 



In order to build a representation which is locally translation 
invariant, a scattering transform begins from a wavelet rep- 
resentation. Translation invariance is obtained by progres- 
sively mapping high frequency wavelet coefficients to lower 
frequencies, with modulus operators described in Section l2~Tl 
Scattering operators, defined in |2.2| iterate over wavelet mod- 
ulus operators. Section |2~3l shows that it defines a translation 
invariant representation, which is Lipschitz continuous to de- 
formation, up to a log term. 

2.1. Wavelet Modulus Propagator 

A wavelet transform extracts information at different scales 
and orientations by convolving a signal / with dilated band- 
pass wavelets -0 7 having a spatial orientation angle 7 G T: 

Wj,~/f(x) = f*ipj, y (x) with ^, 7 0) = 2- 2j iP y (2 j x). 

At the largest scale 2 J , low-frequencies are carried by a low- 
pass scaling function (j>: Ajf = f * cpj, with cj)j(x) — 
2~ 2J <fi(2~ J x) and J <j>(x) dx = 1. The resulting wavelet rep- 
resentation is 
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Wjf = {Ajf, \\j.-f\j. ./.-. 1 • 



The norm of the wavelet operator is defined by 

ii^/ir=u/*<Mi 2 + e n^/ii 2 

i<J,7er 

with ||/|| 2 = J |/(a;)| 2 da; and it satisfies 

(l-*)||/|| 2 <||Wj/|| a <||/|| a (2) 
if and only if for all wel 2 , 

1-5 < |0(2 J ^)| 2 +i (l^(2 J ^)| 2 + \$ n (-1?u)f) < 1. 

j<J, 7 er 

(3) 

We consider families of complex wavelets 

V> 7 (x) = e^ x 6^(x) 

where 1 {x) are low -pass envelops. Oriented Gabor functions 
are examples of complex wavelets, obtained with a modulated 
Gaussian ip(x) — e l ^' x e' x ' ^ 2a \ which is rotated with i? 7 
by an angle 7T7/|r|: ip-y( x ) = ip(R-yX). In numerical exper- 
iments, we set £ = 3tt/A, a — 1, |T| = 6, and cf> is also 
a Gaussian with a = 2/3. It satisfies (O only over a finite 
range of scales. 

If f T (x) =f{x-T) then 

W j>7 f T (x) = W jn f(x - t) w W jn f(x) 

if and only if |r| <C 2- J , because Wj 7 / has derivatives of am- 
plitude proportional to 2 _J ' . High frequencies corresponding 
to fine scales are thus highly sensitive to translations. 

Translation invariance is improved by mapping high fre- 
quencies to lower frequencies with a complex modulus oper- 
ator. Since i/j 7 (x) — e^~> x 9~f(x), we verify that 

W j „f(x) = e i ^ x (f j „*e j „(x)), 

where & )7 = 2~^ 7 , 0j, 7 (a;) = 2- 2 J^ 7 (2" i x) and /j, 7 (x) = 
e <i.i x f(x). Wavelet coefficients Wj, 7 /(a;) are located at 
high frequencies because of the e l ^-i x term. These oscilla- 
tions are removed by a modulus operator 

\w jn f\ = \f jl7 *e jn (x)\. (4) 

The energy of |Wj', 7 /| is now mostly concentrated in the 
low frequency domain covered by the envelop 6j n (cj) = 
8~ f (2 : >uj). It may however also include some high frequencies 
produced by the modulus singularities where /j. 7 * 6j t y(x) = 
0. Using complex wavelets is important to reduce the number 
of such singularities and thus concentrate the information at 
low frequencies. 

If f(x) — ^2 n a n cos(ui n x) then one can verify that 
\Wj yl f(x)\ = Cj 7 + ej,-y(x) where £j n (x) is an interference 
term. It is a combination of the cos(w„ — u) n i)x, for all w„ 
and u! n > in the support of fa(2iui). The modulus yields in- 
terferences that depend upon frequency intervals, but it loses 
the exact frequency locations w n in each octave. 



A wavelet modulus propagator is obtained by applying a 
complex modulus to all wavelet coefficients: 

Ujf = {Ajf,\W j „f\} j<w . 

Since ||a| — < \a — b\ and the wavelet transform is con- 
tractive, it results that 

\\Ujf - Ujg\\ < \\Wjf - Wjg\\ < ||/ - g\\ 

md\\Ujf\\ = ll/H if 5 = in©. 

2.2. Multiple Paths Scattering 

Thanks to the concentration towards the low frequencies, the 
interference coefficients of the wavelet modulus propagator 
can be locally averaged by cf)j in order to produce locally 
translation invariant coefficients with non negligible energy: 

\f -kip jn \*cf)j(x). 

The high frequencies of \f * ^j 1)7l | not removed by the 
convolution with cf>j are carried by the wavelet coefficients 
\f*^j 1 ,-y 1 \*ipj 2 ,-y 2 at scales 2-? 2 < 2 J . To become insensitive 
to local translation and reduce the variability of these coeffi- 
cients, their complex phase can also be removed by a modulus 
which is also averaged by cj>j: 

These second order coefficients provide co-occurrence infor- 
mation at two scales 2 J1 , 2 J2 and in two directions 71 and 
72. This can distinguish corners and junctions from edges 
and characterize texture structures. Coefficients are only cal- 
culated for 2 J2 < 2 J1 because one can show 1 10 1 that if ip is 
an appropriate complex wavelet then |/ * ^.71,71 | * fa 2 ,i 2 ls 
negligible at scales 2 J2 > 2 J1 . 

The high frequencies lost by the filtering with cf>j can 
again be restored with finer scale wavelet coefficients, which 
are regularized by suppressing their phase with a modulus and 
by averaging the result with <pj. Applying iteratively this pro- 
cedure n times yields the following coefficients: 

1 1 |/ * Vil ,71 I * fa*,1* I- * fan ,7 J * <t>J 0) ■ 

At any location x, they provide co-occurrence information be- 
tween any of the |r|™ families of angles 1 < 71, j n < |T| 
and any of the ( ) families of scales satisfying < j\ < ... < 
j n < J. They are called scattering coefficients because they 
can be interpreted as interaction coefficients between / and 
the successive wavelets ipj 1}11 ... ^j n ,j n - 

A scattering operator considers all the scattering coeffi- 
cients at all scales and orientations up to a maximum co- 
occurrence order to. It is indexed along a path variable p = 
{(jn,Jn)} n <\p\<m which is a family of wavelet indices. It 



computes \p\ wavelet convolutions and modulus along the 
path 

Sj(p)f = [_jjf * ^hni I *^ 2 ,72 1 • • • I * ,i\ P \ I * <t>J 
\p\ 

with j n < J and j n e T. Its dimension is Y^n=o I^T(n)- 

One can verify that scattering coefficients for paths of 
length m! are computed by applying the wavelet modulus 
propagator U j to scattering coefficients for all paths p of 
length \p\ = m' — 1: 

{U J S(p)f} pM=m ,^ 1 = {Sj{p)f} pM=m ,^ 1 U{S{p)f} p .\ pl= 

(5) 

where S(p)f = | • • • | / * tp jltJ1 | * V> j2 , 72 | ... | >7W |. 

\p\ 

A scattering operator is thus computed with a cascade of 
convolutions and modulus operators over m+1 layers, similar 
to the convolution network architecture introduced by LeCun 
00: 



Scattering operators are not only contractive but also preserve 
the norm. For appropriate complex wavelets which satisfy (f3]l 
for <5 = 0, one can prove HO) that ||Sj/|| = ||/||. 

When a signal is translated f T (x) = f(x — r), the scat- 
tering transform is also translated 

Sj(p)f T (x) = Sj(p)f(x-r) 

because it is computed with convolutions and modulus. How- 
ever, when J increases, Sj(p)f(x) tends to a constant be- 
cause of the convolutions with <fij. It thus becomes translation 
invariant and one can verify iflOl that the asymptotic scatter- 
, ing metric is translation invariant: 



I/* ^1,71 1 
I 



ll/*V>il,"yil *V>j 2 ,72 ll/*^Jl,7ll *^J2,72l 



After convolution with <pj the output can be subsampled at 
intervals 2 J . If / is an image of N pixels, this uniform sam- 
pling yields a scattering representation Sjf including a total 
ofTVj = 2- 2J NJ2n=o \ T \ n O coefficients. The output of 
any wavelet convolution and modulus |... ★ "0j,fc can be sub- 
sampled at intervals 2- J ~ 1 which reduces intermediate compu- 
tations and barely introduces any aliasing. With an FFT, the 
overall computational complexity is then 0(N log N). 

2.3. Scattering Metric and Deformation Stability 

For appropriate complex wavelets, one can prove [ 1 1 that 
the energy X)| p |= m II^G 3 )/!! 2 °f a scattering layer m tends 
to zero as m increases. This decay is fast. Numerically the 
maximum network depth is typically limited to mo = 3. 

The scattering metric is obtained with a summation over 
all paths p: 

\\Sjf - SjgW* = Y J \\SAp)f - Sj{p)g\\\ 
p 

where \\Sj(p)f\\ 2 = J \Sj(p)f(x)\ 2 dx. Since Sj is cal- 
culated by iterating on the contractive propagator Uj (0), it 
results that it is also contractive [ 8 1 

\\Sjf -Sjg\\ 2 < \\f-g\\ 2 . 



lim 

J— >oo 



Sjf-Sjf T \\ = 



For classification the key scattering property is its Lip- 
schitz continuity to deformations D T f(x) = f(x — t(x)). 
Let |t|oo = su-p x \r(x)\ and (Vrloo = sup^. |Vt(:z;)| < 1, 
where \Vt(x)\ is the matrix sup norm of Vr(x). Along paths 
of length \p\ < mo, one can prove IflOl that for all 2 J > 
ItIoo/IVtIoo the scattering metric satisfies 



\SjD T f-Sjf\\<Cm 



iVrjoo log' 



Vr 



(6) 



The scattering operator is thus Lipschitz continuous to defor- 
mations, up to a log term. It shows that for sufficiently large 
scales 2 J , the signal translations and deformations are locally 
linearized by the scattering operator. 

3. CLASSIFICATION 

Local translation invariance and Lipschitz regularity to local 
deformations linearize small deformations. Signal classes can 
thus be approximated with low-dimensional affine spaces in 
the scattering domain. Although the scattering representation 
is implemented with a potentially deep convolution network, 
learning is not deep and it is reduced to PCA computations. 
The classification is implemented with a penalized model se- 
lection. 

3.1. Affine Scattering Space Models 

A signal class C can be modeled as a realization of a ran- 
dom process F. There are multiple sources of variability, due 
to the reflectivity of the material as in textures, due to defor- 
mations or to various illuminations. Illumination variability is 
often low-frequency and can be approximated in linear spaces 
of dimension close to 10 [ 1 1. This property remains valid in 
the scattering domain. A scattering operator also linearizes 
local deformations and reduces the variance of large classes 
of stationary processes. One can thus build a linear affine 
space approximation of SjF. A scattering transform SjF 
along progressive paths of length \p\ < mo is a vector of size 
O(N), which may be much smaller than iV if J is large. 



The affine space of dimension k which minimizes the 
expected projection error E{\\SjF — ?A t {SjF) || 2 } is 

A fe = fij + V k (7) 

where /j,j(p,x) — E{Sj(p)F(x)} and Vfc is the space gen- 
erated by the first k eigenvectors of the covariance operator of 
Sj(p)F(x), The space dimension k is limited to a maximum 
value K. 

These affine space models are estimated by computing the 
empirical average and the empirical covariance of Sj(p)f(x), 
for all training signals / G C. The empirical covariance is 
diagonalized to estimate the K eigenvectors of largest eigen- 
values. Under mild conditions [14], the sample covariance 
matrix X converges in norm to the true covariance when the 
number of training signals is of the order of the dimensional- 
ity of the space where SjF belongs. Dimensionality reduc- 
tion is thus important to learn affine space models from few 
training signals. 

The computational complexity to estimate affine space 
models A^ is dominated by eigenvectors calculations. To 
compute the first K eigenvectors, a thin SVD algorithm re- 
quires O (T K N) operations, where T is the number of train- 
ing signals. 

3.2. Linear Model Selection 

Let us consider a classification problem with several classes 
{Ci}i<i<i- We introduce a classification algorithm which se- 
lects affine space models by minimizing a penalized approxi- 
mation error. 

Each class C; is represented by a family of embedded 
affine spaces Ak,i = Ai + Vj. where Vk,i is the space gen- 
erated by the first k eigenvectors {eii}i<k of the empirical 
covariance matrix E^. For a fixed dimension k, a space A;^ 
is discriminative for / G d if the projection error of Sjf in 
Afe j is smaller than its projection in the other spaces Ak,i<". 

Vi' , \\Sjf-P Ak .,(Sjf)f > \\Sjf~P Ak ^Sjf)\\ 2 , 

with 

k 

\\Sjf-p Ak A s jf)\\ 2 = WSjf-Mf-^KSjf-ik, e ^\ 2 - 

i=i 

Model selection for classification is not about finding an 
accurate approximation model as in model selection for re- 
gression but looks for a discriminative model [2j. If Sjf for 
/ G Ci is close to the class centroid fii then low-dimensional 
affine spaces Ak,% are highly discriminative even if the re- 
maining error is not negligible, because it is unlikely that any 
other low-dimensional affine space Ak,v yields a comparable 
error. If / is an "outlier" which is far from the centroid fii then 
a higher dimensional approximation space A^ is needed for 
discrimination. One can then adjust the dimensionality of the 



discrimination space to each signal / by penalizing the di- 
mension of the approximation space. The class index i of / 
is estimated by adjusting the dimension k of the space Ak,i 
that yields the best approximation, with a penalization pro- 
portional to the space dimension k J2): 

i(/) = argminmin \\Sjf - P Ak .(Sjf)\\ 2 +f3k . 

i<I k<K 

This classification algorithm depends upon the penaliza- 
tion factor /3 and the scale 2 J of the scattering transform. 
These two parameters are optimized with a cross-validation 
mechanism. It minimizes a classification error computed on a 
validation subset of the training samples, which does not take 
part in the affine model learning. 

• Increasing the scale 2 J reduces the intra-class variabil- 
ity of the representation by building invariance, but it 
can also reduce the distance across classes. The opti- 
mal size 2 is thus a trade-off between both. 

• The penalization parameter (3 is similar to a threshold 
on \ (SjF — jli, e^fe)! 2 . The model increases the dimen- 
sion k of the approximation space if the inner product 
is above (3. Increasing /3 thus reduces the dimension 
of the affine model spaces, which is needed when the 
training sequence is small. 

4. CLASSIFICATION RESULTS AND ANALYSIS 

This section presents classification results for handwritten 
digit recognition, and for texture discrimination with illumi- 
nation variations. The scattering transform is implemented 
with the same Gabor wavelets along T| = 6 orientations for 
both problems, and the maximum scattering length is limited 
to mo = 2. 

4.1. Handwritten Digit Recognition 

The MNIST hand-written digit database provides a good ex- 
ample of classification with important deformations. Table Q] 
compares scattering classification results for training sets of 
variable size, with results obtained with deep-learning convo- 
lutional networks [ 12 1, which currently have the best results. 
TableQ]compares the PCA model selection algorithm applied 
on scattering coefficients and an SVM classifier with poly- 
nomial kernel whose degree was optimized, also applied on 
scattering coefficients. Cross validation finds an optimal scat- 
tering scale J = 3, which corresponds to translations and de- 
formations of amplitude about 2 J = 8 pixels, which is com- 
patible with observed deformations on digits. 

Below 5 10 3 training examples, a PCA scattering classi- 
fier provides state of the art results. It yields smaller errors 
than deep-learning convolution network which require large 



Table 1. Percentage of error as a function of the training size 
for MNIST. Minimum errors are in bold. The last column 
gives the average model space dimension k. 



Training 


ConvNets[12| 


Scatt+SVM 


Scatt+PCA 


300 


7.18 


21.5 


5.93 


1000 


3.21 


3.06 


2.38 


2000 


2.53 


1.87 


1.76 


5000 


1.52 


1.54 


1.27 


10000 


0.85 


1.15 


1.2 


20000 


0.76 


0.92 
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0.74 
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Fig. 1. Relative Intra-class In and average Outer-class Out 
approximation error for the digits i — 1 and i = 4. 



training sets to optimize all network parameters with back- 
propagtion algorithms. For 60 10 3 training samples, the deep- 
learning convolution network error [5] is below the scattering 
classifier error. Table Q] shows that applying a linear SVM 
classifier over the scattering transform degrades the results 
relatively to a PCA classifier up to large training sets, and it 
requires much more computations. This is an indirect valida- 
tion of the linearization properties of the scattering transform. 

Figure Q] shows the relative approximation error when ap- 
proximating a signal class with an affine model in the scatter- 
ing domain. For digits i = 1 and i = 4, it gives the average 
Intra-class approximation error of SjFi with a space Afc^ of 
the same class, as a function of k: 



Table 2. Error rate for the whole USPS database. 



In(i) = 



E{\\SjFi - Pa^SjF.W 2 } 



£{l!^|| 2 } 
It is compared with 

E{\\SjF v - PA ki SjF v || 2 \i^i'} 



Out(i) 



E{\\SjF v p\i^i'} 



which is the average Outer-class approximation error pro- 
duced by the spaces Ak,i over all samples SjF^ belonging 
to different classes i' ^ i. The intra-class error decay is much 
faster than the outer-class error decay for k < 10, which 
shows the discrimination ability of low dimensional affine 
spaces. For k > 10, intra-class versus outer-class distance 
ratio In/Out is approximatively 10~ 2 and lO^ 1 respectively 
for the digits i = 1 and i = 4. It shows the discrimination 
power of these affine models, and the much larger intra-class 
variability for hand-written digits 4 than for hand-written 
digits 1. 

The US-Postal Service set is another handwritten digit 
dataset, with 7291 training samples and 2007 test images 16 x 
16 pixels. The state of the art is obtained with tangent distance 
kernels [4]. Table|2]gives results with a PCA model selection 
on scattering coefficients and a polynomial kernel SVM clas- 
sifier applied to scattering coefficients. The scattering scale 
was also set to J = 3 by cross-validation. 



Scatt+PCA 


Scatt+SVM Tangent kern. 1 4] 


humans 


2.64 


2.64 2.4 


2.37 



4.2. Texture classification: CUREt 

The CureT texture database Q includes 61 classes of image 
textures of N = 200 2 pixels, with 46 training samples and 
46 testing samples in each class. Each texture class gives im- 
ages of the same material with different pose and illumina- 
tion conditions. Specularities, shadowing and surface normal 
variations make it challenging for classification. Figure [2] il- 
lustrates the large intra class variability, and also shows that 
the variability across classes is not always important. 




Fig. 2. Top row: images of the same texture material with 
different poses and illuminations. Bottom row: examples of 
textures that are in different classes despite their similarities. 

Classification algorithms with optimized textons have an 
error rate of 5.35% Q over this database, and the best result 
of 2.57% error rate was obtained in iTHll with an optimized 
Markov Random Field model. 

Wavelets have been shown to be provide useful models 
for texture analysis iPTD . Scattering classification results are 
shown in table [3] with exactly the same algorithm as for digit 
classification. With a PCA it greatly improves existing results 
with an error rate of 0.2%. The SVM classifier with an opti- 



Table 3. Error rate for the CUREt database 



Scatt+PCA 


Scatt+SVM 


Textons[7| MRFs[13| 


0.2 ± 0.08 


1.71 


5.35 2.57 



mized polynomial kernel on scattering coefficients achieves a 
larger error rate of 1.71%. 

The cross-validation adjusts the scattering scale 2 J = 2' 
which is the maximum value. Indeed, these textures are fully 
stationary and increasing the scale reduces the variance of the 
scattering coefficients variability across realizations. Scatter- 
ing vectors Sjf at large scales 2 J have a small stochastic 
variability within each texture class because of the averaging 
by <j)j. Moreover, global invariance to rotation and illumina- 
tion changes is provided by the PCA classification algorithm. 
These invariant linear space models are learnt effectively even 
with few training samples. This example shows that linear 
models are a simple yet powerful mechanism to generate in- 
variance for classification problems. 

5. CONCLUSION 

As a result of their translation invariance and Lipschitz regu- 
larity to deformations, scattering operators provide appropri- 
ate representations to model complex signal classes with 
affine spaces calculated with a PCA. Classification with 
model selection provides state of the art results with lim- 
ited training size sequences, for handwritten digit recognition 
and textures. As opposed to discriminative classifiers such 
as SVM and deep-learning convolution networks, these algo- 
rithms learn a model for each class independently from the 
others, which leads to fast learning algorithms. 

Scattering operators can be defined on more general Lie 
groups other than the group of translations, such as the group 
of rotations or scaling ifTOl . The intra-class variability due to 
the action of several transformation groups can be contracted 
by combining scattering operators adapted to each of these 
groups iflOl . On signal classes including clutter and more 
complex variability, one can estimate the deformation group 
responsible of most of the intra-class variability, provided one 
has enough training samples. 
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